{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<i>Copyright (c) Recommenders contributors.</i>\n",
                "\n",
                "<i>Licensed under the MIT License.</i>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# xDeepFM : the eXtreme Deep Factorization Machine \n",
                "This notebook will give you a quick example of how to train an [xDeepFM model](https://arxiv.org/abs/1803.05170). \n",
                "xDeepFM \\[1\\] is a deep learning-based model aims at capturing both lower- and higher-order feature interactions for precise recommender systems. Thus it can learn feature interactions more effectively and manual feature engineering effort can be substantially reduced. To summarize, xDeepFM has the following key properties:\n",
                "* It contains a component, named CIN, that learns feature interactions in an explicit fashion and in vector-wise level;\n",
                "* It contains a traditional DNN component that learns feature interactions in an implicit fashion and in bit-wise level.\n",
                "* The implementation makes this model quite configurable. We can enable different subsets of components by setting hyperparameters like `use_Linear_part`, `use_FM_part`, `use_CIN_part`, and `use_DNN_part`. For example, by enabling only the `use_Linear_part` and `use_FM_part`, we can get a classical FM model.\n",
                "\n",
                "In this notebook, we test xDeepFM on [Criteo dataset](http://labs.criteo.com/category/dataset)."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 0. Global Settings and Imports"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "System version: 3.7.13 (default, Oct 18 2022, 18:57:03) \n",
                        "[GCC 11.2.0]\n",
                        "Tensorflow version: 2.7.4\n"
                    ]
                }
            ],
            "source": [
                "import os\n",
                "import sys\n",
                "from tempfile import TemporaryDirectory\n",
                "import tensorflow as tf\n",
                "tf.get_logger().setLevel('ERROR') # only show error messages\n",
                "\n",
                "from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources, prepare_hparams\n",
                "from recommenders.models.deeprec.models.xDeepFM import XDeepFMModel\n",
                "from recommenders.models.deeprec.io.iterator import FFMTextIterator\n",
                "from recommenders.utils.notebook_utils import store_metadata\n",
                "\n",
                "print(f\"System version: {sys.version}\")\n",
                "print(f\"Tensorflow version: {tf.__version__}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### Parameters"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {
                "tags": [
                    "parameters"
                ]
            },
            "outputs": [],
            "source": [
                "EPOCHS = 10\n",
                "BATCH_SIZE = 4096\n",
                "RANDOM_SEED = 42  # Set this to None for non-deterministic result\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "xDeepFM uses the FFM format as data input: `<label> <field_id>:<feature_id>:<feature_value>`  \n",
                "Each line represents an instance, `<label>` is a binary value with 1 meaning positive instance and 0 meaning negative instance. \n",
                "Features are divided into fields. For example, user's gender is a field, it contains three possible values, i.e. male, female and unknown. Occupation can be another field, which contains many more possible values than the gender field. Both field index and feature index are starting from 1. <br>"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "100%|██████████| 10.3k/10.3k [00:01<00:00, 7.96kKB/s]\n"
                    ]
                }
            ],
            "source": [
                "tmpdir = TemporaryDirectory()\n",
                "data_path = tmpdir.name\n",
                "yaml_file = os.path.join(data_path, r'xDeepFM.yaml')\n",
                "output_file = os.path.join(data_path, r'output.txt')\n",
                "train_file = os.path.join(data_path, r'cretio_tiny_train')\n",
                "valid_file = os.path.join(data_path, r'cretio_tiny_valid')\n",
                "test_file = os.path.join(data_path, r'cretio_tiny_test')\n",
                "\n",
                "if not os.path.exists(yaml_file):\n",
                "    download_deeprec_resources(r'https://recodatasets.z20.web.core.windows.net/deeprec/', data_path, 'xdeepfmresources.zip')\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## 2. Criteo data \n",
                "\n",
                "Now let's try the xDeepFM on a real world dataset, a small sample from [Criteo dataset](http://labs.criteo.com/category/dataset). Criteo dataset is a well known industry benchmarking dataset for developing CTR prediction models and it's frequently adopted as evaluation dataset by research papers. \n",
                "\n",
                "The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Demo with Criteo dataset\n",
                        "HParams object with values {'use_entity': True, 'use_context': True, 'cross_activation': 'identity', 'user_dropout': False, 'dropout': [0.0, 0.0], 'attention_dropout': 0.0, 'load_saved_model': False, 'fast_CIN_d': 0, 'use_Linear_part': True, 'use_FM_part': False, 'use_CIN_part': True, 'use_DNN_part': True, 'init_method': 'tnormal', 'init_value': 0.1, 'embed_l2': 0.01, 'embed_l1': 0.0, 'layer_l2': 0.01, 'layer_l1': 0.0, 'cross_l2': 0.01, 'cross_l1': 0.0, 'reg_kg': 0.0, 'learning_rate': 0.002, 'lr_rs': 1, 'lr_kg': 0.5, 'kg_training_interval': 5, 'max_grad_norm': 2, 'is_clip_norm': 0, 'dtype': 32, 'optimizer': 'adam', 'epochs': 10, 'batch_size': 4096, 'enable_BN': False, 'show_step': 200000, 'save_model': False, 'save_epoch': 2, 'write_tfevents': False, 'train_num_ngs': 4, 'need_sample': True, 'embedding_dropout': 0.0, 'EARLY_STOP': 100, 'min_seq_length': 1, 'slots': 5, 'cell': 'SUM', 'FIELD_COUNT': 39, 'FEATURE_COUNT': 2300000, 'data_format': 'ffm', 'load_model_name': 'you model path', 'method': 'classification', 'model_type': 'xDeepFM', 'dim': 10, 'layer_sizes': [20, 20], 'activation': ['relu', 'relu'], 'cross_layer_sizes': [20, 10], 'loss': 'log_loss', 'metrics': ['auc', 'logloss']}\n"
                    ]
                }
            ],
            "source": [
                "print('Demo with Criteo dataset')\n",
                "hparams = prepare_hparams(yaml_file, \n",
                "                          FEATURE_COUNT=2300000, \n",
                "                          FIELD_COUNT=39, \n",
                "                          cross_l2=0.01, \n",
                "                          embed_l2=0.01, \n",
                "                          layer_l2=0.01,\n",
                "                          learning_rate=0.002, \n",
                "                          batch_size=BATCH_SIZE, \n",
                "                          epochs=EPOCHS, \n",
                "                          cross_layer_sizes=[20, 10], \n",
                "                          init_value=0.1, \n",
                "                          layer_sizes=[20,20],\n",
                "                          use_Linear_part=True, \n",
                "                          use_CIN_part=True, \n",
                "                          use_DNN_part=True)\n",
                "print(hparams)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Add linear part.\n",
                        "Add CIN part.\n",
                        "Add DNN part.\n"
                    ]
                },
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "2022-11-16 11:30:58.632305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15397 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 6.0\n"
                    ]
                }
            ],
            "source": [
                "model = XDeepFMModel(hparams, FFMTextIterator, seed=RANDOM_SEED)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [
                {
                    "name": "stderr",
                    "output_type": "stream",
                    "text": [
                        "2022-11-16 11:31:03.488364: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8401\n",
                        "2022-11-16 11:31:03.969312: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory\n"
                    ]
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "{'auc': 0.4728, 'logloss': 0.7113}\n"
                    ]
                }
            ],
            "source": [
                "# check the predictive performance before the model is trained\n",
                "print(model.run_eval(test_file)) "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "at epoch 1\n",
                        "train info: logloss loss:744.3602104187012\n",
                        "eval info: auc:0.6637, logloss:0.5342\n",
                        "at epoch 1 , train time: 20.7 eval time: 4.0\n",
                        "at epoch 2\n",
                        "train info: logloss loss:385.66929054260254\n",
                        "eval info: auc:0.7137, logloss:0.5109\n",
                        "at epoch 2 , train time: 19.9 eval time: 3.9\n",
                        "at epoch 3\n",
                        "train info: logloss loss:191.5083179473877\n",
                        "eval info: auc:0.7283, logloss:0.5037\n",
                        "at epoch 3 , train time: 19.7 eval time: 4.1\n",
                        "at epoch 4\n",
                        "train info: logloss loss:92.20774817466736\n",
                        "eval info: auc:0.7359, logloss:0.4991\n",
                        "at epoch 4 , train time: 20.1 eval time: 3.9\n",
                        "at epoch 5\n",
                        "train info: logloss loss:43.15945792198181\n",
                        "eval info: auc:0.74, logloss:0.4963\n",
                        "at epoch 5 , train time: 20.0 eval time: 3.9\n",
                        "at epoch 6\n",
                        "train info: logloss loss:19.656923294067383\n",
                        "eval info: auc:0.7426, logloss:0.4946\n",
                        "at epoch 6 , train time: 20.3 eval time: 3.9\n",
                        "at epoch 7\n",
                        "train info: logloss loss:8.770357608795166\n",
                        "eval info: auc:0.7441, logloss:0.4934\n",
                        "at epoch 7 , train time: 19.9 eval time: 4.0\n",
                        "at epoch 8\n",
                        "train info: logloss loss:3.9227356910705566\n",
                        "eval info: auc:0.7453, logloss:0.4925\n",
                        "at epoch 8 , train time: 19.8 eval time: 4.1\n",
                        "at epoch 9\n",
                        "train info: logloss loss:1.859877161681652\n",
                        "eval info: auc:0.7462, logloss:0.4917\n",
                        "at epoch 9 , train time: 20.2 eval time: 4.0\n",
                        "at epoch 10\n",
                        "train info: logloss loss:1.0249397866427898\n",
                        "eval info: auc:0.747, logloss:0.491\n",
                        "at epoch 10 , train time: 20.2 eval time: 4.0\n",
                        "CPU times: user 3min 57s, sys: 8.46 s, total: 4min 5s\n",
                        "Wall time: 4min\n"
                    ]
                },
                {
                    "data": {
                        "text/plain": [
                            "<recommenders.models.deeprec.models.xDeepFM.XDeepFMModel at 0x7f861579ebd0>"
                        ]
                    },
                    "execution_count": 8,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "%%time\n",
                "model.fit(train_file, valid_file)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "{'auc': 0.7356, 'logloss': 0.5017}\n"
                    ]
                }
            ],
            "source": [
                "# check the predictive performance after the model is trained\n",
                "result = model.run_eval(test_file)\n",
                "print(result)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "application/scrapbook.scrap.json+json": {
                            "data": {
                                "auc": 0.7356,
                                "logloss": 0.5017
                            },
                            "encoder": "json",
                            "name": "result",
                            "version": 1
                        }
                    },
                    "metadata": {
                        "scrapbook": {
                            "data": true,
                            "display": false,
                            "name": "result"
                        }
                    },
                    "output_type": "display_data"
                }
            ],
            "source": [
                "# Record results for tests - ignore this cell\n",
                "store_metadata(\"auc\", result[\"auc\"])\n",
                "store_metadata(\"logloss\", result[\"logloss\"])"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Cleanup\n",
                "tmpdir.cleanup()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Reference\n",
                "\\[1\\] Lian, J., Zhou, X., Zhang, F., Chen, Z., Xie, X., & Sun, G. (2018). xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \\& Data Mining, KDD 2018, London, UK, August 19-23, 2018.<br>"
            ]
        }
    ],
    "metadata": {
        "celltoolbar": "Tags",
        "interpreter": {
            "hash": "3a9a0c422ff9f08d62211b9648017c63b0a26d2c935edc37ebb8453675d13bb5"
        },
        "kernelspec": {
            "display_name": "reco_gpu",
            "language": "python",
            "name": "conda-env-reco_gpu-py"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.7.13"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}
